graph LR
subgraph RAG["Standard RAG"]
A1["Query"] --> B1["Retrieve"] --> C1["Generate"] --> D1["Answer"]
end
subgraph DR["Deep Research Agent"]
A2["Query"] --> B2["Plan"]
B2 --> C2["Search<br/>Iteratively"]
C2 --> D2["Reflect:<br/>Gaps?"]
D2 -->|Yes| C2
D2 -->|No| E2["Triangulate<br/>Sources"]
E2 --> F2["Write<br/>Report"]
end
style RAG fill:#F2F2F2,stroke:#D9D9D9
style DR fill:#F2F2F2,stroke:#D9D9D9
style D1 fill:#e74c3c,color:#fff,stroke:#333
style F2 fill:#27ae60,color:#fff,stroke:#333
Deep Research Agents: from RAG to Autonomous Investigation
Iterative retrieval loops, web search integration, self-reflection, source triangulation, and automated report generation
Keywords: deep research agent, autonomous research, iterative retrieval, web search integration, self-reflection, source triangulation, report generation, STORM, GPT Researcher, LangGraph, multi-agent research, plan-and-execute, Tavily, agentic reasoning, knowledge synthesis

Introduction
Standard RAG retrieves a handful of chunks from a vector store and generates an answer in a single pass. That works for factual questions with clear answers — but it collapses when the task requires investigation: synthesizing information across dozens of sources, cross-checking claims, following citation chains, and producing a structured report with proper attribution.
Deep research agents close this gap. They extend the ReAct loop into a full research workflow: plan what to investigate, search iteratively across the web and local documents, reflect on whether findings are sufficient, triangulate claims across multiple sources, and compile everything into a comprehensive report.
OpenAI, Anthropic, Google, and Perplexity all ship deep research products. Open-source implementations — GPT Researcher, LangChain’s Open Deep Research, and STORM — demonstrate that the pattern is reproducible with commodity LLMs and search APIs.
This article covers the full architecture: from the limitations of single-pass RAG, through the core design patterns (iterative retrieval, self-reflection, source triangulation), to working implementations with LangGraph and LlamaIndex. We build a complete deep research agent from scratch, explore multi-agent research orchestration, and discuss production considerations.
Why Single-Pass RAG Is Not Enough
The Research Gap
Consider the query: “Compare the approaches to AI safety taken by leading labs and summarize the key disagreements.”
A standard RAG pipeline would:
- Embed the query
- Retrieve 5–10 chunks from a vector store
- Generate an answer from those chunks
The result is shallow — it can only reference whatever happens to be in the top-k results. Complex research tasks require fundamentally different behavior:
| Capability | Single-Pass RAG | Deep Research Agent |
|---|---|---|
| Source breadth | 5–10 chunks from one index | 20–100+ sources from web + local |
| Search strategy | One query, one retrieval | Iterative: refine queries based on findings |
| Cross-verification | None — trusts single source | Triangulates claims across multiple sources |
| Decomposition | None — single query | Breaks question into sub-questions |
| Self-reflection | None | Evaluates completeness, identifies gaps |
| Output format | Short answer | Structured report with citations |
| Time budget | Seconds | Minutes to tens of minutes |
When You Need Deep Research
Deep research agents are the right choice when:
- The question requires synthesis across multiple topics or perspectives
- Answers must be grounded in sources with proper citations
- The research scope is open-ended — you don’t know all the sub-topics in advance
- Accuracy matters more than speed — the user can wait minutes for a thorough report
- The output is a deliverable (report, briefing, comparison) rather than a quick answer
Core Design Patterns
Deep research agents combine five interlocking patterns. Each addresses a specific failure mode of simple retrieval.
Pattern 1: Iterative Retrieval Loops
Instead of one-shot retrieval, the agent searches multiple times, using each round’s results to refine subsequent queries. This is the difference between a student who reads the first Google result and one who follows citation chains.
graph TD
A["Research Question"] --> B["Generate<br/>Search Queries"]
B --> C["Execute<br/>Searches"]
C --> D["Extract<br/>Key Findings"]
D --> E{"Sufficient<br/>Coverage?"}
E -->|No| F["Generate<br/>Follow-up Queries"]
F --> C
E -->|Yes| G["Compile<br/>Findings"]
style A fill:#4a90d9,color:#fff,stroke:#333
style E fill:#f5a623,color:#fff,stroke:#333
style G fill:#27ae60,color:#fff,stroke:#333
The key insight: later queries are informed by earlier results. If the first search reveals “RLHF” as a key concept, the agent generates a targeted follow-up query about RLHF — something the original query wouldn’t have surfaced.
from openai import OpenAI
from tavily import TavilyClient
client = OpenAI()
tavily = TavilyClient()
def iterative_research(
question: str,
max_rounds: int = 3,
queries_per_round: int = 3,
) -> dict:
"""Research a question through multiple rounds of search."""
all_findings = []
all_sources = []
search_history = []
for round_num in range(max_rounds):
# Generate search queries based on question + prior findings
query_prompt = f"""Given the research question and findings so far,
generate {queries_per_round} targeted search queries.
Research Question: {question}
Previous findings:
{chr(10).join(f'- {f}' for f in all_findings[-10:]) if all_findings else 'None yet'}
Previous queries (avoid repeating):
{chr(10).join(f'- {q}' for q in search_history)}
Return exactly {queries_per_round} search queries, one per line."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": query_prompt}],
temperature=0.7,
)
queries = [
q.strip().lstrip("0123456789.-) ")
for q in response.choices[0].message.content.strip().split("\n")
if q.strip()
][:queries_per_round]
search_history.extend(queries)
# Execute searches
for query in queries:
results = tavily.search(query=query, max_results=5)
for result in results.get("results", []):
all_sources.append({
"url": result["url"],
"title": result.get("title", ""),
"content": result["content"],
"query": query,
"round": round_num,
})
# Extract findings from this round
extraction_prompt = f"""Based on these search results, extract key findings
relevant to: {question}
Results:
{chr(10).join(r['content'][:500] for r in all_sources[-15:])}
List the most important findings as bullet points."""
extraction = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": extraction_prompt}],
)
round_findings = extraction.choices[0].message.content.strip()
all_findings.append(f"[Round {round_num + 1}] {round_findings}")
# Check if we have enough coverage
sufficiency_check = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"""Rate research completeness
for the question: {question}
Findings so far:
{chr(10).join(all_findings)}
Reply with SUFFICIENT or INSUFFICIENT and a brief explanation."""}],
)
if "SUFFICIENT" in sufficiency_check.choices[0].message.content:
break
return {
"findings": all_findings,
"sources": all_sources,
"rounds_completed": round_num + 1,
}Pattern 2: Web Search Integration
Deep research agents need access to the live web — not just a pre-built vector store. Tavily provides a search API optimized for AI agents, returning cleaned content rather than raw HTML:
from tavily import TavilyClient
tavily = TavilyClient()
# Basic search
results = tavily.search(
query="latest advances in retrieval augmented generation 2025",
max_results=10,
search_depth="advanced", # More thorough crawling
include_raw_content=True, # Full page content
)
# Extract — returns a focused answer with sources
extract = tavily.extract(
urls=["https://arxiv.org/abs/2005.11401"]
)
# Research — full autonomous research workflow
research = tavily.research(
"What are the key differences between RLHF and DPO for LLM alignment?",
)For hybrid research over both web and local documents, combine web search with a vector store retriever:
from langchain_core.tools import tool
from langchain_community.retrievers import TavilySearchAPIRetriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
# Web search tool
@tool
def web_search(query: str) -> str:
"""Search the web for current information on a topic."""
retriever = TavilySearchAPIRetriever(k=5)
docs = retriever.invoke(query)
return "\n\n".join(
f"Source: {d.metadata.get('source', 'unknown')}\n{d.page_content}"
for d in docs
)
# Local document search tool
vector_store = FAISS.load_local("./research_index", OpenAIEmbeddings())
@tool
def local_search(query: str) -> str:
"""Search internal documents and prior research reports."""
docs = vector_store.similarity_search(query, k=5)
return "\n\n".join(
f"Source: {d.metadata.get('source', 'unknown')}\n{d.page_content}"
for d in docs
)Pattern 3: Self-Reflection and Gap Analysis
After each retrieval round, the agent evaluates what it has found and identifies what’s missing. This prevents premature report generation from incomplete evidence.
def reflect_on_findings(
question: str,
findings: list[str],
research_brief: str,
) -> dict:
"""Evaluate research completeness and identify gaps."""
prompt = f"""You are a research quality evaluator. Assess whether the gathered
findings are sufficient to answer the research question comprehensively.
Research Brief: {research_brief}
Research Question: {question}
Findings:
{chr(10).join(findings)}
Evaluate:
1. COVERAGE: What percentage of the research brief is addressed? (0-100)
2. GAPS: What specific sub-topics or perspectives are missing?
3. CONTRADICTIONS: Are there conflicting claims that need resolution?
4. SOURCE_QUALITY: Are the sources authoritative and diverse?
5. VERDICT: SUFFICIENT or NEEDS_MORE_RESEARCH
Respond in JSON format."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)The reflection loop is what separates a deep research agent from a simple multi-step RAG pipeline. It closes the feedback loop: search → evaluate → decide → search again or proceed.
Pattern 4: Source Triangulation
Triangulation means verifying claims by finding them in multiple independent sources. This is how human researchers build confidence in findings and detect misinformation.
def triangulate_claims(claims: list[dict], sources: list[dict]) -> list[dict]:
"""Cross-reference claims against multiple sources."""
prompt = f"""You are a fact-checking agent. For each claim below, determine
how many of the provided sources support, contradict, or are neutral on it.
Claims:
{json.dumps(claims, indent=2)}
Sources:
{json.dumps([{"url": s["url"], "content": s["content"][:300]} for s in sources], indent=2)}
For each claim, respond with:
- claim: the original claim
- support_count: number of sources supporting it
- contradict_count: number of sources contradicting it
- confidence: HIGH / MEDIUM / LOW
- supporting_urls: list of URLs that support this claim
- notes: any important caveats
Respond as a JSON array."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content).get("claims", [])Pattern 5: Automated Report Generation
The final step transforms raw findings into a structured report with sections, citations, and a coherent narrative. The report is generated after all research is complete, using the full findings as context.
def generate_report(
question: str,
findings: list[str],
sources: list[dict],
triangulated_claims: list[dict],
) -> str:
"""Generate a structured research report from findings."""
# Deduplicate and format sources
unique_sources = {}
for s in sources:
if s["url"] not in unique_sources:
unique_sources[s["url"]] = s
numbered_sources = list(unique_sources.values())
prompt = f"""Write a comprehensive research report answering: {question}
## Instructions
- Use the findings and source material below.
- Structure the report with clear sections and subsections.
- Cite sources using numbered references [1], [2], etc.
- Flag claims with LOW confidence from triangulation.
- Include a "Sources" section at the end with numbered URLs.
- Be objective — present multiple perspectives where they exist.
## Research Findings
{chr(10).join(findings)}
## Triangulated Claims (confidence-rated)
{json.dumps(triangulated_claims, indent=2)}
## Available Sources
{chr(10).join(f'[{i+1}] {s["title"]} — {s["url"]}' for i, s in enumerate(numbered_sources))}
Write the report now."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
max_tokens=4096,
)
return response.choices[0].message.contentDeep Research Agent with LangGraph
LangGraph’s graph-based architecture is ideal for deep research — the workflow has clear phases (plan → research → reflect → write) with conditional loops.
Architecture
graph TD
A["User Query"] --> B["Scope &<br/>Brief Generation"]
B --> C["Research<br/>Supervisor"]
C --> D["Spawn<br/>Sub-Agents"]
D --> E1["Sub-Agent 1:<br/>Subtopic A"]
D --> E2["Sub-Agent 2:<br/>Subtopic B"]
D --> E3["Sub-Agent N:<br/>Subtopic N"]
E1 --> F["Collect &<br/>Clean Findings"]
E2 --> F
E3 --> F
F --> G{"Sufficient<br/>Coverage?"}
G -->|No| C
G -->|Yes| H["Write<br/>Report"]
H --> I["Final Report<br/>with Citations"]
style A fill:#4a90d9,color:#fff,stroke:#333
style C fill:#9b59b6,color:#fff,stroke:#333
style E1 fill:#e67e22,color:#fff,stroke:#333
style E2 fill:#e67e22,color:#fff,stroke:#333
style E3 fill:#e67e22,color:#fff,stroke:#333
style G fill:#f5a623,color:#fff,stroke:#333
style I fill:#27ae60,color:#fff,stroke:#333
This follows the three-phase architecture from LangChain’s Open Deep Research: Scope (clarify and create a research brief), Research (supervisor delegates to sub-agents), and Write (compile findings into a report).
State Definition
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
class ResearchState(TypedDict):
query: str # Original user query
research_brief: str # Structured research plan
sub_topics: list[str] # Decomposed sub-questions
findings: Annotated[list[dict], lambda x, y: x + y] # Accumulated findings
sources: Annotated[list[dict], lambda x, y: x + y] # All sources
reflection: dict # Gap analysis results
iteration: int # Current research round
max_iterations: int # Safety limit
report: str # Final outputPhase 1: Scope and Brief Generation
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def generate_brief(state: ResearchState) -> dict:
"""Convert user query into a structured research brief."""
response = llm.invoke(f"""Convert this research request into a structured brief.
Query: {state["query"]}
Create a research brief with:
1. OBJECTIVE: What specific question(s) must be answered
2. SCOPE: What's in scope and out of scope
3. SUB_TOPICS: 3-5 specific sub-questions to investigate
4. SUCCESS_CRITERIA: How to know when research is sufficient
5. OUTPUT_FORMAT: What the final report should look like
Be specific and actionable.""")
# Parse sub-topics from brief
sub_topics_response = llm.invoke(
f"""Extract just the sub-topic questions from this brief as a JSON array of strings:
{response.content}"""
)
import json
try:
sub_topics = json.loads(sub_topics_response.content)
except json.JSONDecodeError:
sub_topics = [state["query"]]
return {
"research_brief": response.content,
"sub_topics": sub_topics,
"iteration": 0,
}Phase 2: Research with Sub-Agents
Each sub-topic gets its own research sub-agent with an isolated context window — a key architectural lesson from production deep research systems. This prevents context pollution between unrelated sub-topics.
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
from tavily import TavilyClient
tavily = TavilyClient()
@tool
def web_search(query: str) -> str:
"""Search the web for information on a topic. Returns relevant content with source URLs."""
results = tavily.search(query=query, max_results=5, search_depth="advanced")
formatted = []
for r in results.get("results", []):
formatted.append(f"Source: {r['url']}\nTitle: {r.get('title', '')}\n{r['content']}")
return "\n\n---\n\n".join(formatted) if formatted else "No results found."
@tool
def extract_page(url: str) -> str:
"""Extract the full content of a web page for detailed analysis."""
result = tavily.extract(urls=[url])
if result.get("results"):
return result["results"][0].get("raw_content", "")[:5000]
return "Could not extract content from URL."
# Create a research sub-agent
research_sub_agent = create_react_agent(
model=ChatOpenAI(model="gpt-4o-mini", temperature=0),
tools=[web_search, extract_page],
prompt="""You are a focused research agent. Your job is to thoroughly
research ONE specific sub-topic. Search multiple times to get comprehensive
coverage. After researching, summarize your findings with specific citations.
Always cite your sources with URLs. Search at least 2-3 times with different
queries to get broad coverage.""",
)
def research_subtopics(state: ResearchState) -> dict:
"""Run research sub-agents for each sub-topic."""
new_findings = []
new_sources = []
for subtopic in state["sub_topics"]:
# Each sub-agent gets a clean context
result = research_sub_agent.invoke({
"messages": [{"role": "user", "content": f"Research this topic thoroughly: {subtopic}"}]
})
# Extract the final answer from the agent
final_msg = result["messages"][-1].content
new_findings.append({
"subtopic": subtopic,
"content": final_msg,
"iteration": state["iteration"],
})
# Extract source URLs from tool call results
for msg in result["messages"]:
if hasattr(msg, "content") and "Source: http" in str(msg.content):
for line in str(msg.content).split("\n"):
if line.startswith("Source: http"):
url = line.replace("Source: ", "").strip()
new_sources.append({
"url": url,
"subtopic": subtopic,
})
return {
"findings": new_findings,
"sources": new_sources,
"iteration": state["iteration"] + 1,
}Phase 3: Reflection and Gap Analysis
def reflect_on_research(state: ResearchState) -> dict:
"""Evaluate research completeness and identify gaps."""
findings_text = "\n\n".join(
f"### {f['subtopic']}\n{f['content']}" for f in state["findings"]
)
response = llm.invoke(f"""Evaluate the research completeness.
Research Brief:
{state['research_brief']}
Findings So Far:
{findings_text}
Assess:
1. What percentage of the brief is covered? (0-100)
2. What specific gaps remain?
3. Are there contradictions that need resolution?
4. What follow-up queries would fill the gaps?
Respond in JSON with keys: coverage_pct, gaps, contradictions, follow_up_queries, verdict (SUFFICIENT or NEEDS_MORE)""")
import json
try:
reflection = json.loads(response.content)
except json.JSONDecodeError:
reflection = {"verdict": "SUFFICIENT", "coverage_pct": 80, "gaps": []}
# Update sub-topics with follow-up queries if more research needed
new_sub_topics = reflection.get("follow_up_queries", [])
return {
"reflection": reflection,
"sub_topics": new_sub_topics if new_sub_topics else state["sub_topics"],
}
def should_continue_research(state: ResearchState) -> str:
"""Decide whether to continue researching or write the report."""
reflection = state.get("reflection", {})
verdict = reflection.get("verdict", "SUFFICIENT")
if state["iteration"] >= state["max_iterations"]:
return "write_report"
if verdict == "NEEDS_MORE" and state.get("sub_topics"):
return "research"
return "write_report"Phase 4: Report Writing
def write_report(state: ResearchState) -> dict:
"""Compile findings into a structured report."""
findings_text = "\n\n".join(
f"### {f['subtopic']}\n{f['content']}" for f in state["findings"]
)
# Deduplicate sources
unique_sources = list({s["url"]: s for s in state["sources"]}.values())
sources_text = "\n".join(
f"[{i+1}] {s['url']}" for i, s in enumerate(unique_sources)
)
report_llm = ChatOpenAI(model="gpt-4o", temperature=0)
response = report_llm.invoke(f"""Write a comprehensive research report.
Research Brief:
{state['research_brief']}
Research Findings:
{findings_text}
Available Sources:
{sources_text}
Instructions:
- Write a well-structured report with sections and subsections
- Cite sources using [1], [2], etc. matching the source list above
- Present multiple perspectives where they exist
- Flag uncertain claims
- End with a numbered Sources section
- Aim for thoroughness and clarity""")
return {"report": response.content}Assembling the Graph
# Build the graph
graph = StateGraph(ResearchState)
graph.add_node("generate_brief", generate_brief)
graph.add_node("research", research_subtopics)
graph.add_node("reflect", reflect_on_research)
graph.add_node("write_report", write_report)
graph.set_entry_point("generate_brief")
graph.add_edge("generate_brief", "research")
graph.add_edge("research", "reflect")
graph.add_conditional_edges(
"reflect",
should_continue_research,
{"research": "research", "write_report": "write_report"},
)
graph.add_edge("write_report", END)
app = graph.compile()
# Run research
result = app.invoke({
"query": "Compare the approaches to AI safety across leading labs",
"max_iterations": 3,
"findings": [],
"sources": [],
})
print(result["report"])Streaming Research Progress
For real-time feedback during long-running research:
async for event in app.astream(
{
"query": "What are the key trends in LLM inference optimization?",
"max_iterations": 3,
"findings": [],
"sources": [],
},
stream_mode="updates",
):
for node_name, state_update in event.items():
if node_name == "research":
print(f"📚 Completed research round, found {len(state_update.get('findings', []))} topics")
elif node_name == "reflect":
reflection = state_update.get("reflection", {})
print(f"🔍 Coverage: {reflection.get('coverage_pct', '?')}%")
print(f" Gaps: {reflection.get('gaps', [])}")
elif node_name == "write_report":
print(f"📝 Report generated ({len(state_update.get('report', ''))} chars)")Deep Research Agent with LlamaIndex
LlamaIndex’s workflow system and built-in RAG tools make it well-suited for deep research over both local documents and web sources.
Research Agent with Query Engine Tools
from llama_index.llms.openai import OpenAI
from llama_index.core.agent.workflow import ReActAgent
from llama_index.core.tools import FunctionTool, QueryEngineTool
from llama_index.core.workflow import Context
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# Build a local knowledge base
documents = SimpleDirectoryReader("./research_papers").load_data()
index = VectorStoreIndex.from_documents(documents)
local_engine = index.as_query_engine(similarity_top_k=10)
local_tool = QueryEngineTool.from_defaults(
query_engine=local_engine,
name="local_research",
description="Search local research papers and documents. "
"Use for finding specific technical details, prior research, "
"and internal knowledge.",
)
# Web search tool
def search_web(query: str) -> str:
"""Search the web for current information, news, and external sources."""
from tavily import TavilyClient
tavily = TavilyClient()
results = tavily.search(query=query, max_results=5, search_depth="advanced")
formatted = []
for r in results.get("results", []):
formatted.append(f"[{r.get('title', '')}]({r['url']}): {r['content']}")
return "\n\n".join(formatted) if formatted else "No results found."
web_tool = FunctionTool.from_defaults(fn=search_web)
# Reflection tool
def evaluate_findings(findings_summary: str, original_question: str) -> str:
"""Evaluate whether the current findings sufficiently answer the question.
Returns gaps and suggested follow-up queries."""
llm = OpenAI(model="gpt-4o-mini")
response = llm.complete(
f"""Evaluate these research findings for completeness:
Question: {original_question}
Findings: {findings_summary}
Identify:
1. Coverage gaps
2. Unsupported claims
3. Missing perspectives
4. Suggested follow-up searches
Be specific about what's missing."""
)
return str(response)
reflection_tool = FunctionTool.from_defaults(fn=evaluate_findings)
# Create the research agent
research_agent = ReActAgent(
tools=[local_tool, web_tool, reflection_tool],
llm=OpenAI(model="gpt-4o", temperature=0),
system_prompt="""You are a thorough research agent. For each question:
1. PLAN: Break the question into sub-topics
2. SEARCH: Use both local_research and search_web for each sub-topic
3. REFLECT: Use evaluate_findings to check completeness
4. ITERATE: Search again for any gaps identified
5. SYNTHESIZE: Compile a comprehensive answer with citations
Always search at least 3 times before reflecting. Always cite your sources.""",
)
ctx = Context(research_agent)
response = await research_agent.run(
"What are the state-of-the-art approaches to reducing LLM hallucination?",
ctx=ctx,
)
print(response)Multi-Step Research Workflow
For more control over the research process, use a structured workflow:
from llama_index.llms.openai import OpenAI
from llama_index.core.tools import FunctionTool
from llama_index.core.agent.workflow import ReActAgent, AgentWorkflow
# Planner agent — decomposes the question
planner = ReActAgent(
name="planner",
description="Breaks research questions into sub-topics and creates a research plan.",
tools=[],
llm=OpenAI(model="gpt-4o-mini"),
system_prompt="""You are a research planner. Given a question, break it into
3-5 specific sub-questions that together would provide a comprehensive answer.
For each sub-question, suggest what type of source would be best (academic, web, internal docs).
When done, hand off to 'researcher' with your plan.""",
can_handoff_to=["researcher"],
)
# Researcher agent — executes searches
researcher = ReActAgent(
name="researcher",
description="Executes search queries and gathers evidence from multiple sources.",
tools=[web_tool, local_tool],
llm=OpenAI(model="gpt-4o-mini"),
system_prompt="""You are a research executor. Follow the research plan provided.
For each sub-question, search at least 2 different sources. Record all source URLs.
When research is complete, hand off to 'writer' with your findings.""",
can_handoff_to=["writer", "planner"],
)
# Writer agent — compiles the report
writer = ReActAgent(
name="writer",
description="Synthesizes research findings into a structured report with citations.",
tools=[reflection_tool],
llm=OpenAI(model="gpt-4o"),
system_prompt="""You are a research report writer. Given research findings:
1. Use evaluate_findings to check for gaps
2. If gaps exist, hand back to 'researcher' with specific follow-up questions
3. If sufficient, write a structured report with:
- Executive summary
- Detailed findings by topic
- Numbered source citations
- Conclusion""",
can_handoff_to=["researcher"],
)
# Create the multi-agent workflow
workflow = AgentWorkflow(
agents=[planner, researcher, writer],
root_agent="planner",
)
ctx = Context(workflow)
result = await workflow.run(
"What are the most effective techniques for reducing latency in LLM serving?",
ctx=ctx,
)Open-Source Deep Research Systems
GPT Researcher
GPT Researcher is the most mature open-source deep research agent. Its architecture follows the Plan-and-Execute pattern:
graph TD
A["Research Query"] --> B["Planner Agent:<br/>Generate Sub-Questions"]
B --> C1["Crawler Agent 1"]
B --> C2["Crawler Agent 2"]
B --> C3["Crawler Agent N"]
C1 --> D["Summarize &<br/>Track Sources"]
C2 --> D
C3 --> D
D --> E["Filter &<br/>Aggregate"]
E --> F["Publisher:<br/>Generate Report"]
F --> G["Report<br/>(PDF/Docx/MD)"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#9b59b6,color:#fff,stroke:#333
style C1 fill:#e67e22,color:#fff,stroke:#333
style C2 fill:#e67e22,color:#fff,stroke:#333
style C3 fill:#e67e22,color:#fff,stroke:#333
style G fill:#27ae60,color:#fff,stroke:#333
Key design decisions:
- Parallel crawlers: Multiple agents search simultaneously for different sub-questions, then their findings are aggregated
- Source diversity: Scrapes 20+ sources per research task to minimize bias
- Deep research mode: Tree-like recursive exploration with configurable depth and breadth (~5 min, ~$0.40 per run with
o3-mini)
from gpt_researcher import GPTResearcher
researcher = GPTResearcher(
query="What are the implications of multimodal LLMs for autonomous driving?",
report_type="research_report",
)
# Conduct research (iterative search + analysis)
research_result = await researcher.conduct_research()
# Generate the final report
report = await researcher.write_report()
print(report)STORM: Synthesis Through Multi-Perspective QA
STORM (Stanford/Shao et al., 2024) takes a different approach: it simulates conversations between domain experts to generate Wikipedia-style articles. The process:
- Discover perspectives: Identify different angles on the topic (e.g., for “climate change”: scientist, economist, policymaker)
- Simulate interviews: Each perspective asks questions to a “topic expert” grounded in web sources
- Curate outline: Organize collected information into a structured outline
- Write article: Generate from the outline with proper citations
graph TD
A["Topic"] --> B["Discover<br/>Perspectives"]
B --> C1["Expert 1<br/>asks questions"]
B --> C2["Expert 2<br/>asks questions"]
B --> C3["Expert 3<br/>asks questions"]
C1 --> D["Topic Expert<br/>(grounded in web)"]
C2 --> D
C3 --> D
D --> E["Curate<br/>Outline"]
E --> F["Write<br/>Article"]
F --> G["Wikipedia-style<br/>Article"]
style A fill:#4a90d9,color:#fff,stroke:#333
style D fill:#9b59b6,color:#fff,stroke:#333
style G fill:#27ae60,color:#fff,stroke:#333
STORM’s key insight: multi-perspective questioning produces more comprehensive articles than single-perspective research. Evaluation showed STORM articles were rated 25% more organized and 10% broader in coverage compared to standard outline-driven RAG.
LangChain Open Deep Research
LangChain’s open-source implementation follows a three-phase approach with a research supervisor that orchestrates sub-agents:
| Phase | Purpose | Key Technique |
|---|---|---|
| Scope | Clarify query, generate brief | User clarification → research brief compression |
| Research | Gather evidence | Supervisor spawns parallel sub-agents with isolated contexts |
| Write | Produce final report | One-shot generation from brief + all findings |
Lessons from their production deployment:
- Multi-agent only for parallelizable research — writing in parallel produces disjoint reports; write in one shot after all research
- Context isolation — sub-agents with separate contexts avoid cross-contamination between unrelated subtopics
- Context engineering — compress chat history into briefs, prune raw tool outputs before returning to supervisor to avoid token bloat
Architecture Comparison
| System | Architecture | Search Strategy | Multi-Agent | Report Quality |
|---|---|---|---|---|
| OpenAI Deep Research | RL-trained agent with browser + Python | Reinforcement-learned browsing | Single agent | Highest — with embedded images, citations |
| GPT Researcher | Plan-and-Execute | Parallel crawlers, 20+ sources | Planner + crawler agents | Long-form with PDF/Docx export |
| STORM | Multi-perspective QA | Simulated expert interviews | Multiple “expert” personas | Wikipedia-style articles |
| LangChain Open Deep Research | Supervisor + sub-agents | Supervisor-delegated parallel search | Supervisor → N sub-agents | Brief-driven comprehensive reports |
| Custom LangGraph | State graph with reflection | Iterative with gap analysis | Configurable | Depends on implementation |
| Custom LlamaIndex | Workflow-based agent | ReAct with hybrid local + web | AgentWorkflow handoffs | Depends on implementation |
Production Considerations
Cost Management
Deep research is token-intensive. A single research task can consume 50K–200K tokens across planning, searching, reflecting, and writing:
| Operation | Approximate Tokens | Cost (GPT-4o-mini) |
|---|---|---|
| Brief generation | 2K–5K | ~$0.002 |
| Per sub-agent research round | 10K–30K | ~$0.01 |
| Reflection per round | 3K–8K | ~$0.003 |
| Report writing (GPT-4o) | 10K–20K | ~$0.10 |
| Total (3 rounds, 4 subtopics) | 80K–200K | $0.15–$0.50 |
Optimization strategies:
- Use
gpt-4o-minifor planning, search, and reflection;gpt-4oonly for final report - Compress sub-agent findings before returning to supervisor (remove raw HTML, irrelevant results)
- Cache search results to avoid redundant API calls
- Set token budgets per sub-agent and abort early if exceeded
Latency
Deep research takes minutes, not seconds. Set user expectations:
- 3–5 subtopics × 2–3 search rounds × 5–10 seconds per search = 30–150 seconds for research alone
- Parallelize sub-agent execution to reduce wall-clock time
- Stream progress updates (current subtopic, search round, gap analysis results) for real-time feedback
Quality Control
def quality_gate(report: str, brief: str) -> dict:
"""Final quality check before delivering the report."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"""Evaluate this research report:
Brief: {brief}
Report: {report}
Check:
1. Does it address all points in the brief?
2. Are all claims cited with sources?
3. Are there any hallucinated facts (claims with no source)?
4. Is the structure clear and logical?
5. Score (1-10) for: completeness, accuracy, clarity
Respond in JSON."""}],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)Handling Contradictory Sources
Real-world research frequently encounters conflicting information. The agent should:
- Flag contradictions during reflection
- Present both sides in the report with source attribution
- Assess source authority — prefer primary sources, peer-reviewed papers, official documentation
- Note confidence levels — clearly mark uncertain claims
Common Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Context window overflow | Agent crashes on long research sessions | Compress findings per round; use sub-agents with isolated contexts |
| Search query repetition | Agent searches the same thing in every round | Track search history; enforce diversity in query generation |
| Source bias | Report only reflects one perspective | Explicitly prompt for multiple perspectives; verify source diversity |
| Infinite research loops | Agent never decides findings are sufficient | Set max iterations; implement coverage scoring with a concrete threshold |
| Citation hallucination | Sources cited in report don’t match actual URLs | Pass numbered source list to report writer; validate citations post-generation |
| Token cost explosion | Research costs dollars instead of cents | Budget per sub-agent; compress raw content; cache search results |
| Shallow sub-agent research | Sub-agents make one search and stop | Prompt sub-agents to search at least 2–3 times; require evidence from multiple sources |
Conclusion
Deep research agents represent the natural evolution from single-pass RAG to autonomous investigation. The core pattern is consistent across all implementations:
Key takeaways:
- Plan → Search → Reflect → Write is the fundamental loop. The reflection step — evaluating completeness and identifying gaps — is what makes research deep rather than just broad.
- Iterative retrieval outperforms single-shot retrieval by allowing later queries to be informed by earlier findings. Each round narrows the knowledge gap.
- Source triangulation catches misinformation by cross-referencing claims across multiple independent sources. Never trust a single source for important claims.
- Multi-agent architectures with isolated contexts scale better than single-agent approaches for multi-topic research. Use sub-agents for research; write the final report in one pass.
- Context engineering is critical — compress chat history into briefs, prune raw tool outputs, and set token budgets to avoid context window limits and cost explosion.
- LangGraph excels at building custom research workflows with explicit state management, conditional loops, and streaming progress updates. LlamaIndex shines when combining local RAG indices with web search in a ReAct agent.
Start with a simple iterative retrieval loop, verify it improves answer quality for your use case, then layer on reflection, triangulation, and multi-agent orchestration as needed.
References
- OpenAI, Introducing Deep Research, February 2025. Blog
- LangChain, Open Deep Research, July 2025. Blog
- Shao et al., Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models (STORM), NAACL 2024. arXiv:2402.14207
- Wu et al., Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools, ACL 2025. arXiv:2502.04644
- Elovic, GPT Researcher: Autonomous Agent for Comprehensive Online Research, 2024. GitHub
- Wang et al., Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning, ACL 2023. arXiv:2305.04091
- LangGraph Documentation, Open Deep Research Repository, 2025. GitHub
- Tavily Documentation, Research API Reference, 2026. Docs
Read More
- Build the foundational agent loop with Building a ReAct Agent from Scratch — the Thought-Action-Observation pattern that underpins deep research agents.
- Add tool calling and function calling for structured interactions with search APIs and retrieval tools.
- Design the state graph backbone with Building Agents with LangGraph — nodes, edges, and conditional routing.
- Orchestrate multiple research sub-agents using Multi-Agent RAG Orchestration Patterns.
- Add persistent context across research sessions with Memory Systems for Long-Running Retrieval Agents.
- Decompose complex research questions with Planning and Query Decomposition for Complex Retrieval.
- Ground your research agent in domain documents with Building a RAG Pipeline from Scratch and evaluate result quality with Evaluating RAG Systems.